<HTML>
<HEAD>
<TITLE>examples  -  Example Programs using the SEQIO Package</TITLE>
<owner_name="James Knight, knight@cs.ucdavis.edu">
<LINK REV="made" HREF="mailto:knight@cs.ucdavis.edu">
</HEAD>

<BODY>

<I><A HREF="seqio.html">SEQIO -- A Package for Sequence File I/O</A></I>
<HR>

<P>
<H1>examples  -  Example Programs using the SEQIO Package</H1>

<HR>

<H2>Usage</H2>

<PRE>
   example1  keyword  file

   example2  files...

   example3  files...

   example4 [options] files...
      -a string    -  Match an author's last name
      -d string    -  Match a substring of definition
      -e string    -  Match a substring in the entry
      -g string    -  Match an element of geneaology
      -j string    -  Match an journal name
      -k string    -  Match a keyword
      -o string    -  Match the formal organism name
      -r string    -  Match a substring of reference title

   typeseq  files...

   wcseq  files...

   grepseq  [-# | -a | -d | -l | -m | -o file | -p]  pattern  files...
</PRE>

<P>
<H2>Descriptions (except for grepseq)</H2>

These are simple programs (except maybe for grepseq) that show the
various techniques for using the SEQIO package.  The `example1'
program is a simple version of the grepseq program which takes a
keyword string and a single file/database specifier and searches the
input sequences for any exact match to the keyword string.  The entry
of any matching sequence is output.

<P>
The `example2' program is a sequence information display program.  It
displays all of the information the SEQIO package can automatically
extract from each input entry.

<P>
The `example3' program shows how to extract information from the EMBL
feature tables using the SEQIO package.  Specifically, the program
outputs the "note" subfield of every "CDS_pept" feature of the EMBL
files given to it as input.  Since the example was not meant as a
general feature extraction program (although it should show you how to
create one if you want), the feature name, subfield, and file format
were not generalized but hard-coded into the program.

<P>
The `example4' program is an entry filtering program for GenBank
formatted entries.  Essentially, a primitive form of the type of
filtering done by SRS.  In addition to the files/databases (which must
be in GenBank format), options can be given to tell the program only
to output entries containing the appropriate information (see the
option list above and the program comments for the specifics about
that information).  The program will output all input entries whose
information matches all of the specified options (i.e., AND'ing all of
the option filters).  The matching for the individual options is
case-insensitive.

<P>
The program `typeseq' is essentially a combination of a `cat' program
for sequences and a simple `fetch' program (like GCG's FETCH).  All
the program actually does is output its input entries.  But, because
the SEQIO package has the capability to access single entries of files
and databases, the "fetch" capability is automatically included in
this simple program.

<P>
The program `wcseq' counts the number of sequences, number of entries,
and total size of the sequences in the input (like `wc' counts lines,
words and characters).  The code to this program shows how a
programmer can more closely control how files and databases are read,
instead of leaving all of the database translation and reading to the
SEQIO package.

<P>
As I said, these programs were written mostly to show programmers how
to use the SEQIO package, but the typeseq, wcseq and the grepseq
program described below may also be useful utility programs.

<P>
<H2>Description of grepseq</H2>

The `grepseq' program takes a keyword which can contain ambiguous
characters and character classes (also called a fixed-width motif) and
then searches files and databases for exact or approximate matches to
that keyword.  The program produces one of two kinds of output, either
a list of the matching sequences with the places where the keyword
matched, or the complete entries of sequences containing matches,
where each entry is annotated with the places where the matches occur.

<P>
<H2>Program Options</H2>

The command line should consist of the keyword, one or more
files/databases, and any program options.  The first command
line argument which does not begin with a '-' is taken as the keyword,
and the rest not beginning with '-' are the files/databases.  The
program options should be specified individually on the command line.
So, "-d -l -3" is valid, whereas "-dl3" is not.
<PRE>
The program options:
  -#  -  The number of errors to allow.
  -a  -  Ambiguous characters of the text should used in the matching.
  -d  -  The text will be DNA or RNA.
  -l  -  Only list the sequences that match.  Don't output their entries.
  -m  -  Only allow mismatch errors.  No indels.
  -o filename - The output file.
  -p  -  The text will be Protein.

Examples:
   grepseq ACGYNNCGT mydna1 mydna2 mydna3
   grepseq -2 -m -l TDSSYDKER swissprot
   grepseq -a -1 [LIVY]G[LIVMFYAG][DK]S[LIVMT]...[DET]....[LIVMF] pir
</PRE>

The keyword format uses the same syntax as the grep family of programs,
except that it only allows character classes of the form "[...]" and
"[^...]" (where the '^' at the beginning of the class specifies that
the character class is the complement of the given characters).  Like
the grep family, a period '.' always matches anything, and the
backslash character is used to specify periods, brackets and
backslashes in the keyword.

<P>
(NOTE:  Only keywords whose width is 31 or smaller can be handled by
the program.  The string itself can be longer since it can contain
character classes specifying the characters for one position, but the
number of positions specified by the keyword cannot be longer than
31.)

<P>
For the input files/databases, if a command line argument specifies an
existing file, then that file is taken as input.  Otherwise, the
argument is treated as a database search specification.  In order for
databases to be searched, a BIOSEQ file must be created to describe
all of your databases, and the environment variable "BIOSEQ" must be
set to that file.  Then, a database search is specified by giving (1)
the name of the database to search the whole database, as in "genbank"
or "swiss-prot", (2) the database name, a colon and then a list of
files and aliases to search a part of the database, as in
"pir:pir1.dat,pir2.dat" or "genbank:phg,inv", or (3) a database name
followed by a `suffix alias', as in "pir1" or "gbphg" which searches
the files defined by the suffix alias.  See file "user.doc" for
complete information on <A HREF="seqio_user.html#bioseq">creating
BIOSEQ files</A> and <A HREF="seqio_user.html#dbsearch">specifying
database searches</A>. 

<P>
<H2>Program Execution</H2>

The program preprocesses the keyword under all three alphabets,
DNA/RNA, Protein and ASCII.  For DNA/RNA and Protein, the standard
IUPAC alphabet characters are used in the preprocessing under those
alphabets.  Then, for each input sequence, once the alphabet has been
determined, the correspondingly processed keyword is used for the
matching.  If no alphabet is specified by the program options, by the
BIOSEQ entry or by the sequence entry, the alphabet is guessed at.  It
is assumed to be DNA/RNA if all of the characters are in the alphabet
and at least 85% of them are A, C, G or T/U.  Else, if all the
characters are Protein characters, the alphabet is assumed to be
Protein.  Otherwise, the alphabet is assumed to be ASCII.

<P>
The program then searches each input sequence to find substrings that
match the keyword either exactly, by default, or within the specified
number of mismatches, insertions and deletions (or just mismatches if
`-m' is specified).  By default, the program will ignore any ambiguous
characters in the input sequences.  This avoids the problem of any
keyword always matching a string of N's in DNA/RNA or X's in Protein.
If the `-a' option is specified, then those ambiguous characters will
be included in the matching, and ambiguous characters are considered a
match if the intersection of the character classes they represent is
non-empty.

<P>
If the `-l' option is specified, the output will consists of a list of
one-line descriptions of each matching sequence, along with the places
in the sequence where the matches occur.  The first 6 non-overlapping
matches will be reported for each such sequence (the best scoring
match of any overlapping matches is reported).  If `-l' is not
specified, then the entries of matching sequences are output, where
the comment section of each entry is annotated with a description of
where the matches occurred in that sequence.

<P>
When complete entries are being output, the format for all output is
the format of the first input file.  So, if the input contains
files/databases with different formats, some of the entries may be
converted to the format of the first input file/database in order for
the output to be a single file format.

<P>
(NOTE:  When insertions and deletions are allowed, the reported left
endpoint of a match will not necessarily be the actual left endpoint
of the match.  The reported left endpoint is always computed by
subtracting the keyword size from the right endpoint and adding 1.
The algorithm which performs the matching cannot keep track of the
left endpoints without a significant decrease in running time.  The
reported right endpoint, however, will always be the actual right
endpoint of the match.)


<P>
<HR>
<ADDRESS> 
<a href="http://wwwcsif.cs.ucdavis.edu/~knight">James R. Knight,</a>
<a href="mailto:knight@cs.ucdavis.edu">knight@cs.ucdavis.edu</a><BR>
July 9, 1996
</ADDRESS>
</BODY>
